Correcting delayed reporting of COVID‐19 using the generalized‐Dirichlet‐multinomial method

Abstract The COVID‐19 pandemic has highlighted delayed reporting as a significant impediment to effective disease surveillance and decision‐making. In the absence of timely data, statistical models which account for delays can be adopted to nowcast and forecast cases or deaths. We discuss the four key sources of systematic and random variability in available data for COVID‐19 and other diseases, and critically evaluate current state‐of‐the‐art methods with respect to appropriately separating and capturing this variability. We propose a general hierarchical approach to correcting delayed reporting of COVID‐19 and apply this to daily English hospital deaths, resulting in a flexible prediction tool which could be used to better inform pandemic decision‐making. We compare this approach to competing models with respect to theoretical flexibility and quantitative metrics from a 15‐month rolling prediction experiment imitating a realistic operational scenario. Based on consistent leads in predictive accuracy, bias, and precision, we argue that this approach is an attractive option for correcting delayed reporting of COVID‐19 and future epidemics.

to compare a joint (spatio-temporal) model to independent time series models.

A.1 Data
We use data from the Brazilian state of Paraná, which was severely affected by the 2009 H1N1 epidemic compared to other states (Codeço et al., 2012) and continues to have one of the highest rates of SARI incidence (Bastos et al., 2019). Here we consider a much longer time period of 230 weeks (from the start of January 2013 to the end of May 2017), compared to 66 weeks in Bastos et al. (2019), to enable us to draw meaningful conclusions about seasonal variation. The state is divided into s = {1, . . . , 22} health regions and we consider the total count to be fully observed 6 months after occurrence (D max = 27). The dimension of the total counts y t,s is therefore 230x22 and the dimension of the partial counts z t,s,d is 230x22x27 (corresponding to over 100k observations). For this application, we imagine that the present-day week, t 0 , is week 224 (mid April 2017). We then seek to make predictions for t 0 = 224, for previous weeks where the total count is still partially unobserved (t = {t 0 − D max + 2, . . . , t 0 − 1} = {199, . . . , 223}) and for the next 6 weeks (t = {t 0 + 1, . . . , t 0 + 6} = {225, . . . , 230}).  The plot shows a clear seasonal cycle across all regions, with outbreaks reaching their peak in April-July, as well as considerable year-to-year variability. There is also some evidence of regional variation in the overall rate -for example, the brightest green region tends to have quite a low rate of cases per 100,000 people, compared to some other regions -as well as regional variation in the seasonal timing of outbreaks. At "present day" t 0 = 224, shown by the vertical line, we are in the early stages of the annual influenza outbreak, so forecasting predictions should ideally show an increasing trend in the number of SARI cases.

A.2 Model for SARI cases
In Section 4.2, we introduced a nested spline structure to add a regional dimension to the time series model for dengue fever cases presented in Stoner and Economou (2020). To model the incidence of SARI cases we adopt a near identical approach, making use of spatially-varying intercept, temporal and seasonal effects: f (t, s) = ι s + δ t,s + ξ t,s ; Effects δ t,s and ξ t,s are temporal and seasonal (cyclic) splines, respectively, which vary by region. Figure 1, however, suggests that a large portion of temporal and seasonal variation may be common across all regions. Once again, we can introduce temporal and seasonal effects α t and η t , and make their (basis function) coefficients the mean of the coefficients for the regional effects δ t,s and ξ t,s . Similarly, the model for the expected cumulative proportion reported after each delay is characterised by the addition of fixed delay effects ψ s,d which are independent across regions and temporal spline effects γ t,s : Prior distributions and implementation are the equivalent to the COVID-19 model, detailed in Section 4.3.

A.3 Results
Figure 2 shows median predicted temporal (δ t,s , left) and seasonal (ξ t,s , centre) effects on SARI incidence, as well as the temporal effect on the cumulative proportion reported (γ t,s , right). A different colour is used for each region and the dashed black lines show the median predicted overall effects, α t , η t and β t , respectively.
The estimated effects on SARI incidence follow the overall trends quite closely, with only a few deviating substantially. For example, there are noticeable increases in the temporal effect on SARI incidence for almost all regions around mid 2013 and around mid 2016, corresponding to the two largest outbreaks seen in Figure 1.
Similarly, all the seasonal effects reflect the increase in SARI incidence leading up to Brazil's winter seen in Figure 1. The effects on the cumulative proportion reported are substantially more variable, suggesting that the delay mechanism may be driven more by local factors compared to SARI incidence. Summarising, the 22 health regions of Paraná have a lot in common, in terms of temporal and seasonal variation in SARI incidence. It is worth examining, however, whether anything tangible was gained from modelling the regions simultaneously as opposed to using 22 independent time series models. For instance, modelling the regions independently could impede the model's ability to capture the variance of the total number of reported cases across all regions. To assess this, we applied 22 independent models where f (t) = ι + δ t + ξ t and g(t, d) = ψ d + γ t for each region. We then used posterior predictive checking (Gelman et al., 2014) to see which approach captures the variance of the total better.   For this experiment, the overall 95% prediction interval coverage was 0.99 for predictions of the total reported count corresponding to previous weeks (t < t 0 ), 1 when forecasting (t > t 0 ) and 1 when nowcasting (t = t 0 ). q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q  All code and data required to reproduce these results are provided as supplementary material.

Appendix B Illustration of Data Structure
To support understanding of the reporting delay in the COVID-19 hospital deaths application, this section illustrates the data structure. First, recall from Section 1 that a portion of deaths occurring on a given day (e.g. 7th February) will be reported during the same 24-hour reporting period they occurred in and published the following day (8th February) at 5pm. Some more deaths may be reported in the next 24-hour period, and published the day after that (9th February), and so on until no more deaths are reported. If we compile data files published each day, we can arrange the number of deaths (for each region) by date of death (rows) and reporting date ( Then, assuming a maximum delay of D max = 4 days for this example, we can organise the data in terms of the delay between date of occurrence and date of reporting (publication): Note that there is now a triangular shape of unknown values, shown as "?", which represents the numbers of deaths which are yet to be reported. This is often called the "delay triangle" in the literature. For any date of death where not all delays counts have been observed, the total deaths is also not yet completely observed. The statistical problem is predicting the unknown total deaths, conditional on any partial information for those dates of death, and on any completely observed past total death counts. The final table below shows the same data with modelling notation (where t is the 10th of February): Reporting

Appendix C Simulation Experiment
In the main article, our assessment of the GDM predominantly focuses on out-ofsample prediction performance, reflecting its intended use as a method for operational nowcasting and forecasting. However, we also claimed that the GDM can provide insights into factors determining the structure of the reporting delay and variability in the total count. We test that here by fitting the GDM to simulated data, such that inference can be compared against the known effect of covariates.

C.1 Data
The data we simulate are completely synthetic, but are intended to represent a hypothetical disease epidemic, with covariates imitating real-world drivers of change in the disease progression and reporting performance. Note that repeated simulation would be necessary to assess bias, variance, prediction interval coverage etc. Here we simulate one set of data as an example and we use the phrase "true value(s)" to mean the coefficient values chosen in this one example.
Let y t,s be the number of disease cases occurring on day t ∈ 1, . . . , 100 in region s ∈ 1, 2, 3, and let z t,s,d be the parts of y t,s,d observed at each delay d ∈ 1, . . . , D max .
The experiment focuses on the quality of inference rather than prediction performance, so we assume all y t,s and z t,s,d are known. First, we simulate y t,s from a Negative-Binomial model with mean λ t,s and scale parameter θ s : The mean number of cases is comprised first of a fourth-order orthogonal polynomials in time δ t,s . The polynomials, shown in the left panel of Figure 5, are distinct for each region to reflect differences in disease progression and non-pharmaceutical interventions. In all 3 regions, the simulated polynomials have an "M" shape, representing two waves of the disease. Then, the mean is also affected by the proportion of the population successfully administered a vaccine in each region, V t,s , shown in the right panel of Figure 5. The vaccine becomes available after 40 days, and is then administered at a different rate in each region. After scaling to have 0 mean and standard deviation 1 (let V ( * ) t,s be the scaled version), our imaginary vaccination covariate has coefficient α 1 = −1.75, meaning the case rate reduces substantially as more people are vaccinated. t,s are both assumed constant across regions, meaning we assume the protection of the vaccine and properties of the new variant to be the same across regions.
Combining regional intercept terms ι s , polynomials δ t,s , the effect of vaccination, and the effect prevalence of variant 2, Figure 7 shows the overall mean daily cases for each region (lines) and then the simulated daily cases y t,s (shapes). The M shape from the polynomials is still visible, but dampened by the effect of the vaccination covariate. Now, it remains to simulate z t,s,d : the parts of y t,s reported at each delay. Here, we simulate the first D = 7 delays for modelling. We can ignore the delay structure of the remainder term because, as mentioned previously, we assume all z t,s,d are observed. To define the mean reporting distribution, we start with a probit model for the expected cumulative proportion reported S t,s,d , as in the Survivor variant of the GDM (described in Section 3 of the main article): Temporal variation in S t,s,d is characterised first by monotonically increasing "delay curve" effects ψ s,d and upwards linear trends in (scaled) time X ( * ) t . We also included staff absence percentage covariates A t,s (Figure 8), simulated as first or-der autoregressive variables. The coefficients of the scaled absence percentage A ( * ) t,s , β 2,s , are intuitively negative so that a higher percentage of absences slows down reporting. Finally, we have also included the scaled mean daily number of cases λ ( * ) t,s as a covariate in the cumulative proportion reported. The coefficients of this effect, β 3,s , are also negative such that, when cases are high, reporting slows down. Inclusion of this covariate is notable because: a) to the best of our knowledge, this is the first attempt (albeit for simulated data) to fit a model for delayed reporting which considers the effect of the same covariates on both the mean total cases/deaths and the reporting delay; and b) effectively including the same covariates in both parts of the GDM hierarchy is more likely to cause inferential problems such as non-identifiability of the covariate coefficients. Combining all of the effects in (5) yields the mean cumulative proportion reported at each delay S t,s,d , as shown in Figure 9. Notably, reporting performance visibly slows down in the first 30 days or so as cases surge (as seen in Figure 7). One can imagine this reflecting health systems prioritising treatment over administrative tasks when cases are high.
Using these simulated S t,s,d , we then simulated two different sets of delayed counts z t,s,d , to be fitted separately. We simulated the first set, z Generalised-Dirichlet-Multinomial (GDM) itself: This allows us to assess inference when fitting the model to data generated from it. Then, we also simulated a second set, z We then add an i.i.d. Gaussian noise term ϵ t,s,d ∼ Normal(0, 0.25 2 ) to each u t,s,d and input this into an inverse CLR transformation to produce a new noisier set of proportions µ (2) t,s,d . A standard deviation of 0.25 was chosen for ϵ t,s,d so that the variance of z (2) is fairly close to the variance of z (1) . From these new (noisier) proportions reported at each delay, we can compute relative proportions ν (2) t,s,d : which can then be used as the mean of Binomial conditional distributions to simulate z (2) t,s,d : t,s,j ).
Fitting a GDM model to z (2) t,s,d then allows us to test the flexibility of the GDM to capture non-GDM variance structures, through posterior predictive checking (Gelman et al., 2014).

C.2 Results
We fit the GDM model outlined in equations (3)-(7) to the simulated two data sets.
We use the NIMBLE package (de Valpine et al., 2017) for MCMC and weaklyinformative prior distributions (e.g. Normal(0, 10 2 ) for coefficients), as in the main article.
A key question of interest is whether the known trends and covariate effects in the mean daily cases and in the reporting delay are appropriately captured by the model. To assess this, we should examine outputs from the model fit to z (1) t,s,d . We can first look at the polynomial trends in the mean daily cases. Figure  10 shows the posterior median estimates of δ t,s with 95% credible intervals (CIs).
The polynomials are reproduced very closely, with the true values (dashed lines) captured completely by the 95% CIs.
Next, we can look at the coefficients in the mean daily cases for the vaccine (α 1 ) and for the prevalence of Variant 2 (α 2 ). Figure 11 shows density estimates of the posterior samples of these two coefficients, with vertical lines representing the true values. In these plots, we are looking to see whether the true values are extreme with respect to their corresponding posterior distributions. Here, the true values for both coefficients are towards the centre of the distributions, indicating the models captures the effect of these covariates well.  Then, Figure 12 shows the posterior densities for the cumulative proportion reported coefficients of time (β 1,s ), staff absence (β 2,s ), and the daily case rate (β 3,s ), respectively. In most cases, the true values are well within the bulk of the distributions, and after computing 95% CIs we determined that they all contain their corresponding true values.
Finally, we can assess whether the GDM can appropriately capture the variance of the second set of delay counts z (2) t,s,d . Recall that these were simulated from Binomial-Gaussian mixture distributions, rather than a GDM. To achieve this, we use the MCMC samples to simulate posterior predictive replicates of the original t,s,d . Figure 13 shows the posterior predictive sample standard deviations for the first 6 delays. The true values are all within the bulk of the replicate distributions, indicating the model has captured the variances of the delayed counts well despite being generated from a different model.

C.3 Conclusions
Here we simulated a data set of daily disease case counts in 3 regions, with covariates imitating real-world drivers of disease (vaccination, proliferation of different variants) and reporting delays (staff absences, pressure from high case rates). Investigating the latter effect is particularly unusual, because it means we tested a model design that effectively included the same covariates in both the model for the total counts and the model for the reporting delay. Despite this, the GDM was able to reproduce all of the known covariate effects well. We also tested the fit of the GDM to delay counts z (2) t,s,d simulated from a different model, in this case a Binomial-Gaussian mixture. Compellingly, the flexibility of the GDM meant that it was able to capture the variance of these alternative counts very closely.

COVID-19 Deaths
In Section 4.5 of the main article, we compare two versions of the GDM against four competing models for COVID-19 deaths in a rolling nowcasting experiment.
Here we provide details on the specification and implementation of each of these models.

D.1 GDM Hazard model
With the first version of the GDM being the GDM Survivor model described in Section 4.2, we detail second version based on the Hazard variant of the GDM here. The hierarchical structure is the same as before: y t,s | λ t,s , θ s ∼ Negative-Binomial(λ t,s , θ s ); (11) z t,s | y t,s , ν t,s , ϕ t,s ∼ GDM(ν t,s , ϕ t,s , y t,s ).
The model for the mean daily fatality rate λ t,s is also the same as before: with δ t,s being independent penalised splines of time for each region s, for the purpose of the prediction experiment (as explained in Section 4.5). The difference in the Hazard variant is that the Beta-Binomial means ν t,s,d are modelled directly: Here

D.2 Negative-Binomial Survivor model (NB Survivor)
The first model competing against the GDM is a Negative-Binomial model intended to approximate the (as-yet) unknown marginal model for the delayed counts z t,s,d obtained from the GDM when integrating out the total counts y t,s . The key feature of this model is that the systematic models for the total count and the reporting delay are otherwise identical to those presented in Section 4.2: log(λ t,s ) = ι s + δ t,s ; where µ t,s,d is the mean proportion at each delay, again defined by a probit model for the cumulative proportion reported: and λ t,s is the mean of y t,s . Like in the GDM model for COVID-19 deaths, we include models for the first D ′ = 6 delayed counts z t,s,d , but also an extra (identical) model for the remainder r t,s = y t,s − D ′ d=1 z t,s,d -which is unobserved when y t,s is unobserved (see Stoner and Economou (2020) for the rationale behind modelling the remainder). The model is implemented using MCMC following the specification in Section 4.3 for prior distributions and implementation.

D.3 Negative-Binomial model based on Bastos et al. (INLA)
The second competitor is a Negative-Binomial model for the delay counts z t,s,d based on the framework and implementation detailed in Bastos et al. (2019): Independently for each region, ι s are the intercepts, δ t,s are first order random walks (RW1) (to capture the epidemic curve), β d,s are RW1 effects to capture the mean delay distribution, γ t,s,d are RW1 temporal effects for each delay to capture temporal variation in the delay distribution, and ξ t,s,d are cyclic RW2 effects of day of the week for each delay, to capture weekly cycles in the reporting delay.
The main difference between this model and the above marginal approximation to the GDM is the lack of separation of systematic variability in the total count and the reporting delay. The model is implemented using the Integrated Nested Laplace Approximation method and the r-inla package (Lindgren and Rue, 2015), using code adapted from Bastos et al. (2019).We found that modelling each of the d = 1, . . . , 14 delays separately resulted in extreme 95% prediction interval upper bounds for some data cutoffs (C). We therefore opted to model only the first D = 6 delays, together with a remainder term r t,s (as in the NB Survivor model), which solved the problem.

14)
The third and fourth competitor models are based on the framework for the delayed counts z t,s,d proposed by McGough et al. (2020) and implemented using the NobBS package for R. The package allows for both Poisson and Negative-Binomial models, and we opted for the latter to account for over-dispersion: log(λ t,s,d ) = α t,s + log(β d,s ); where α t,s is an RW1 effect to capture the epidemic curve and β d,s are vectors of proportions such that Dmax d=1 β d,s = 1 (recall the maximum delay D max = 14). Effects β d,s therefore capture the expected proportion reported at each delay and are modelled as fixed in time. To account for temporal heterogeneity in the delay distribution, McGough et al. (2020) proposes fitting the model to data in a moving window of fixed length, so that the estimated β d,s are more appropriate for recent data. We fit two models using this approach, one with the same 70 day window length as we used for the other approaches (NobBS) and one with a shorter window of 14 days , so that the estimated mean delay distribution is representative of the last two weeks. In both cases, we used the default weakly-informative prior distributions from the NobBS package, which implements the model using the JAGS software facility for MCMC (Plummer, 2003).

D.5 Measurement of computation times
To compare the practicality of each approach, we measured computation times needed to run the whole experiment (all 153 cutoff dates), but some care is needed in how these are interpreted because we did not employ the same parallel computing strategies for each method. All models were run on the same Ubuntu Linux Desktop with an Intel Core i9-7900X CPU and 128Gb of system memory (plus a 256Gb swap drive). For the GDM Survivor, GDM Hazard, and NB Survivor approaches, we ran models (with data for all regions) for different cutoff dates simultaneously across 16 CPU threads. We therefore divide the total run time for these by the ceiling function of (153/16). For the INLA models, we ran models sequentially for each region and for each cutoff date, with parallelisation handled automatically by INLA. We therefore divided the total run time for the INLA approach by 153. Finally, for the NobBS models we ran models for each region in parallel across 7 threads (one per region) and for each cutoff date sequentially. As all the models we tested were independent across regions, they all could feasibly have been run in parallel across regions in the same manner as we did for NobBS. Therefore, for a fair comparison of the relative practicality of each model for daily operation use, we divided the total run time for the NobBS models by (153/7).
The resulting indicative run times are presented in the rightmost column of Table   1 in the main article.
weeks (W = 105). To account for the change in the length of the time series, we decreased the number of knots in the splines of time proportionally from 10 to 5 in the models with W = 35 (days), and increased the number of knots proportionally to 15 for W = 105.
To investigate differences in accuracy, precision, and reliability, we computed mean average errors of predictions, mean 95% prediction interval widths, and 95% prediction interval coverage values.

Appendix F Under-reporting
Where count data are affected by delayed-reporting, the total reported count, y t,s,d , is often still a substantial under-representation of the true count, termed here x t,s,d .
Reports of COVID-19 cases may, for instance, be affected by under-reporting due to a lack of testing availability, false negative test results, or a lack of symptoms.
Similarly, some deaths due to COVID-19 may be missed if the patient was not tested and COVID-19 was not specified on the death certificate. To take this into account, Stoner and Economou (2020) presents a comprehensive framework for simultaneously modelling under-reporting and delayed-reporting. Extended here to include a spatial dimension, this is achieved by replacing Equation (1) in Section 3 of the main text with: x t,s | λ t,s , θ s ∼ Negative-Binomial(λ t,s , θ s ); (23) y t,s | x t,s , π t,s ∼ Binomial(π t,s , x t,s ); log π t,s 1 − π t,s = i(t, s), for y t,s ≤ x t,s and where i(t, s) is a general function which may include covariates or random effects, e.g. covariates representing access to COVID-19 tests. The likelihood for y t,s is non-identifiable between a high λ t,s and a low π t,s , or viceversa, so in the case where all available counts are assumed potentially underreported (i.e. x t,s is always unobserved), identifiability can be achieved using prior information .